library(readr)
White_wines <- read.table("~/Desktop/Big Data/Regression-1/White_wines.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
View(White_wines)
## Warning: running command ''/usr/bin/otool' -L '/Library/Frameworks/
## R.framework/Resources/modules/R_de.so'' had status 1
# Import Data
#View(White_wines)
Look at a summary of the data.
summary(White_wines)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
kable(summary(White_wines), format = "markdown")
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 3.800 | Min. :0.0800 | Min. :0.0000 | Min. : 0.600 | Min. :0.00900 | Min. : 2.00 | Min. : 9.0 | Min. :0.9871 | Min. :2.720 | Min. :0.2200 | Min. : 8.00 | Min. :3.000 | |
| 1st Qu.: 6.300 | 1st Qu.:0.2100 | 1st Qu.:0.2700 | 1st Qu.: 1.700 | 1st Qu.:0.03600 | 1st Qu.: 23.00 | 1st Qu.:108.0 | 1st Qu.:0.9917 | 1st Qu.:3.090 | 1st Qu.:0.4100 | 1st Qu.: 9.50 | 1st Qu.:5.000 | |
| Median : 6.800 | Median :0.2600 | Median :0.3200 | Median : 5.200 | Median :0.04300 | Median : 34.00 | Median :134.0 | Median :0.9937 | Median :3.180 | Median :0.4700 | Median :10.40 | Median :6.000 | |
| Mean : 6.855 | Mean :0.2782 | Mean :0.3342 | Mean : 6.391 | Mean :0.04577 | Mean : 35.31 | Mean :138.4 | Mean :0.9940 | Mean :3.188 | Mean :0.4898 | Mean :10.51 | Mean :5.878 | |
| 3rd Qu.: 7.300 | 3rd Qu.:0.3200 | 3rd Qu.:0.3900 | 3rd Qu.: 9.900 | 3rd Qu.:0.05000 | 3rd Qu.: 46.00 | 3rd Qu.:167.0 | 3rd Qu.:0.9961 | 3rd Qu.:3.280 | 3rd Qu.:0.5500 | 3rd Qu.:11.40 | 3rd Qu.:6.000 | |
| Max. :14.200 | Max. :1.1000 | Max. :1.6600 | Max. :65.800 | Max. :0.34600 | Max. :289.00 | Max. :440.0 | Max. :1.0390 | Max. :3.820 | Max. :1.0800 | Max. :14.20 | Max. :9.000 |
This dataset is composed of 12 variables. The dependent variable of interest is quality. We will investigate the relationship between the remaining variables (fixed acidity, volatile acid, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol) and quality.
Quality appears to be normally distributed with scores ranging from a minimum of 3 to a maximum of 9, with a mean score of of 5.88 and a median of 6.0. A boxplot of quality shows the potential of outliers. These should be considered when interpretting the remainder of the analysis.
with(White_wines, Hist(quality, scale="frequency", breaks="Sturges",
col="darkgray"))
#Figure 2: Boxplot of Quality
Boxplot( ~ quality, data=White_wines, id.method="y")
## [1] "252" "254" "295" "446" "741" "874" "1035" "1230" "1418" "1485"
## [11] "775" "821" "828" "877" "1606" "18" "21" "23" "69" "75"
The distribiution of the remaining variables can be seen in the histograms below. Residual sugar, alcohol, and volatile acid appear to have right skewed distributions, while not perfect, the other variables appear to have a normal distribution.
with(White_wines, Hist(alcohol, scale="frequency", breaks="Sturges",
col="darkgray"))
with(White_wines, Hist(chlorides, scale="frequency", breaks="Sturges",
col="darkgray"))
#Figure 5: Histogram of Citric Acid
with(White_wines, Hist(citric.acid, scale="frequency", breaks="Sturges",
col="darkgray"))
#Figure 6: Histogram of Density
with(White_wines, Hist(density, scale="frequency", breaks="Sturges",
col="darkgray"))
#Figure 7: Histogram of Fixed Acidity
with(White_wines, Hist(fixed.acidity, scale="frequency", breaks="Sturges",
col="darkgray"))
#Figure 8: Histogram of Free Sulfur Dioxide
with(White_wines, Hist(free.sulfur.dioxide, scale="frequency",
breaks="Sturges", col="darkgray"))
#Figure 9: Histogram of pH
with(White_wines, Hist(pH, scale="frequency", breaks="Sturges",
col="darkgray"))
#Figure 10: Histogram of Residual Sugar
with(White_wines, Hist(residual.sugar, scale="frequency", breaks="Sturges",
col="darkgray"))
#Figure 11: Histogram of Sulphates
with(White_wines, Hist(sulphates, scale="frequency", breaks="Sturges",
col="darkgray"))
#Figure 12: Histogram of Total Sulfur Dioxide
with(White_wines, Hist(total.sulfur.dioxide, scale="frequency",
breaks="Sturges", col="darkgray"))
#Figure 13: Histogram of Volatile Acidity
with(White_wines, Hist(volatile.acidity, scale="frequency",
breaks="Sturges", col="darkgray"))
To begin investigating potential relationships scattlot matrices have been run below.
scatterplotMatrix(~alcohol+chlorides+citric.acid+quality, reg.line=FALSE,
smooth=FALSE, spread=FALSE, span=0.5, ellipse=FALSE, levels=c(.5, .9),
id.n=0, diagonal = 'density', data=White_wines)
#Figure 14: Scatterplot Matrix: quality, density, fixed acidity, free sulfur dioxide.
scatterplotMatrix(~density+fixed.acidity+free.sulfur.dioxide+quality,
reg.line=FALSE, smooth=FALSE, spread=FALSE, span=0.5, ellipse=FALSE,
levels=c(.5, .9), id.n=0, diagonal = 'density', data=White_wines)
scatterplotMatrix(~pH+quality+residual.sugar+sulphates, reg.line=FALSE,
smooth=FALSE, spread=FALSE, span=0.5, ellipse=FALSE, levels=c(.5, .9),
id.n=0, diagonal = 'density', data=White_wines)
scatterplotMatrix(~quality+total.sulfur.dioxide+volatile.acidity,
reg.line=FALSE, smooth=FALSE, spread=FALSE, span=0.5, ellipse=FALSE,
levels=c(.5, .9), id.n=0, diagonal = 'density', data=White_wines)
Linear Correlation analysis shows:
kable(cor(White_wines[,c("alcohol","chlorides","citric.acid","density",
"fixed.acidity","quality")], use="complete"))
| alcohol | chlorides | citric.acid | density | fixed.acidity | quality | |
|---|---|---|---|---|---|---|
| alcohol | 1.0000000 | -0.3601887 | -0.0757287 | -0.7801376 | -0.1208811 | 0.4355747 |
| chlorides | -0.3601887 | 1.0000000 | 0.1143644 | 0.2572113 | 0.0230856 | -0.2099344 |
| citric.acid | -0.0757287 | 0.1143644 | 1.0000000 | 0.1495026 | 0.2891807 | -0.0092091 |
| density | -0.7801376 | 0.2572113 | 0.1495026 | 1.0000000 | 0.2653310 | -0.3071233 |
| fixed.acidity | -0.1208811 | 0.0230856 | 0.2891807 | 0.2653310 | 1.0000000 | -0.1136628 |
| quality | 0.4355747 | -0.2099344 | -0.0092091 | -0.3071233 | -0.1136628 | 1.0000000 |
kable(cor(White_wines[,c("free.sulfur.dioxide","pH","quality","residual.sugar",
"sulphates")], use="complete"))
| free.sulfur.dioxide | pH | quality | residual.sugar | sulphates | |
|---|---|---|---|---|---|
| free.sulfur.dioxide | 1.0000000 | -0.0006178 | 0.0081581 | 0.2990984 | 0.0592172 |
| pH | -0.0006178 | 1.0000000 | 0.0994272 | -0.1941335 | 0.1559515 |
| quality | 0.0081581 | 0.0994272 | 1.0000000 | -0.0975768 | 0.0536779 |
| residual.sugar | 0.2990984 | -0.1941335 | -0.0975768 | 1.0000000 | -0.0266644 |
| sulphates | 0.0592172 | 0.1559515 | 0.0536779 | -0.0266644 | 1.0000000 |
kable(cor(White_wines[,c("quality","total.sulfur.dioxide","volatile.acidity")],
use="complete"))
| quality | total.sulfur.dioxide | volatile.acidity | |
|---|---|---|---|
| quality | 1.0000000 | -0.1747372 | -0.1947230 |
| total.sulfur.dioxide | -0.1747372 | 1.0000000 | 0.0892605 |
| volatile.acidity | -0.1947230 | 0.0892605 | 1.0000000 |
| #Table 2:Correlation of | Independent | Variables With Wine Qua | lity |
alcohol 0.435574715 chlorides -0.209934411 citric.acid -0.009209091 density -0.307123313 fixed.acidity -0.113662831 free.sulfur.dioxide 0.008158067 pH 0.099427246 residual.sugar -0.097576829 sulphates 0.053677877 total.sulfur.dioxide -0.1747372 volatile.acidity -0.1947230
There seems to be a weak positive relationship between alcohol and quality. Density, chlorides, total sulfur dioxide, and volatile acid, seem to have the strongest negative correlations with quality.
To further investigate potential relationships between quality and the variables linear regressions have been run below.
Regression Model Alcohol and quality.
RegModel.Alcohol <- lm(alcohol~quality, data=White_wines)
summary(RegModel.Alcohol)
##
## Call:
## lm(formula = alcohol ~ quality, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2986 -0.7882 -0.1382 0.8014 4.1223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.95670 0.10626 65.47 <2e-16 ***
## quality 0.60524 0.01788 33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.108 on 4896 degrees of freedom
## Multiple R-squared: 0.1897, Adjusted R-squared: 0.1896
## F-statistic: 1146 on 1 and 4896 DF, p-value: < 2.2e-16
Regression Model fixed.acidity and quality.
RegModel.fixed.acidity <- lm(fixed.acidity~quality, data=White_wines)
summary(RegModel.fixed.acidity)
##
## Call:
## lm(formula = fixed.acidity ~ quality, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0416 -0.5499 -0.0499 0.4667 7.3584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.49138 0.08042 93.152 < 2e-16 ***
## quality -0.10830 0.01353 -8.005 1.48e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8385 on 4896 degrees of freedom
## Multiple R-squared: 0.01292, Adjusted R-squared: 0.01272
## F-statistic: 64.08 on 1 and 4896 DF, p-value: 1.48e-15
Regression Model volatile.acidity and quality.
RegModel.volatile.acidity <- lm(volatile.acidity~quality, data=White_wines)
summary(RegModel.volatile.acidity)
##
## Call:
## lm(formula = volatile.acidity ~ quality, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20986 -0.06554 -0.01554 0.04446 0.78014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.408504 0.009483 43.08 <2e-16 ***
## quality -0.022161 0.001595 -13.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09888 on 4896 degrees of freedom
## Multiple R-squared: 0.03792, Adjusted R-squared: 0.03772
## F-statistic: 193 on 1 and 4896 DF, p-value: < 2.2e-16
Regression Model citric.acid and quality.
RegModel.citric.acid <- lm(citric.acid~quality, data=White_wines)
summary(RegModel.citric.acid)
##
## Call:
## lm(formula = citric.acid ~ quality, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3366 -0.0653 -0.0153 0.0547 1.3260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.341588 0.011608 29.427 <2e-16 ***
## quality -0.001258 0.001953 -0.644 0.519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.121 on 4896 degrees of freedom
## Multiple R-squared: 8.481e-05, Adjusted R-squared: -0.0001194
## F-statistic: 0.4153 on 1 and 4896 DF, p-value: 0.5193
Regression Model residual.sugar and quality.
RegModel.residual.sugar <- lm(residual.sugar~quality, data=White_wines)
summary(RegModel.residual.sugar)
##
## Call:
## lm(formula = residual.sugar ~ quality, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.300 -4.482 -1.023 3.412 59.477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.67613 0.48420 19.98 < 2e-16 ***
## quality -0.55882 0.08146 -6.86 7.72e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.048 on 4896 degrees of freedom
## Multiple R-squared: 0.009521, Adjusted R-squared: 0.009319
## F-statistic: 47.06 on 1 and 4896 DF, p-value: 7.724e-12
Regression Model chlorides and quality.
RegModel.chlorides <- lm(chlorides~quality, data=White_wines)
summary(RegModel.chlorides)
##
## Call:
## lm(formula = chlorides ~ quality, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.042498 -0.009319 -0.003140 0.003860 0.295681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0762135 0.0020490 37.20 <2e-16 ***
## quality -0.0051789 0.0003447 -15.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02136 on 4896 degrees of freedom
## Multiple R-squared: 0.04407, Adjusted R-squared: 0.04388
## F-statistic: 225.7 on 1 and 4896 DF, p-value: < 2.2e-16
Regression Model free.sulfur.dioxide and quality.
RegModel.free.sulfur.dioxide <- lm(free.sulfur.dioxide~quality, data=White_wines)
summary(RegModel.free.sulfur.dioxide)
##
## Call:
## lm(formula = free.sulfur.dioxide ~ quality, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.171 -12.171 -1.484 10.516 254.143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.3872 1.6313 21.080 <2e-16 ***
## quality 0.1567 0.2744 0.571 0.568
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.01 on 4896 degrees of freedom
## Multiple R-squared: 6.655e-05, Adjusted R-squared: -0.0001377
## F-statistic: 0.3259 on 1 and 4896 DF, p-value: 0.5681
Regression Model total.sulfur.dioxide and quality.
RegModel.total.sulfur.dioxide <- lm(total.sulfur.dioxide~quality, data=White_wines)
summary(RegModel.total.sulfur.dioxide)
##
## Call:
## lm(formula = total.sulfur.dioxide ~ quality, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.107 -28.722 -2.337 28.278 277.508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 187.6464 4.0138 46.75 <2e-16 ***
## quality -8.3849 0.6752 -12.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.85 on 4896 degrees of freedom
## Multiple R-squared: 0.03053, Adjusted R-squared: 0.03034
## F-statistic: 154.2 on 1 and 4896 DF, p-value: < 2.2e-16
Regression Model density and quality.
RegModel.density <- lm(density~quality, data=White_wines)
summary(RegModel.density)
##
## Call:
## lm(formula = density ~ quality, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.007718 -0.002104 -0.000361 0.001859 0.045079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+00 2.730e-04 3663.07 <2e-16 ***
## quality -1.037e-03 4.593e-05 -22.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002847 on 4896 degrees of freedom
## Multiple R-squared: 0.09432, Adjusted R-squared: 0.09414
## F-statistic: 509.9 on 1 and 4896 DF, p-value: < 2.2e-16
Regression Model pH and quality.
RegModel.pH <- lm(pH~quality, data=White_wines)
summary(RegModel.pH)
##
## Call:
## lm(formula = pH ~ quality, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47034 -0.10034 -0.01034 0.08966 0.61966
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.088623 0.014413 214.301 < 2e-16 ***
## quality 0.016952 0.002425 6.992 3.08e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1503 on 4896 degrees of freedom
## Multiple R-squared: 0.009886, Adjusted R-squared: 0.009684
## F-statistic: 48.88 on 1 and 4896 DF, p-value: 3.081e-12
Regression Model sulphates and quality.
RegModel.sulphates <- lm(sulphates~quality, data=White_wines)
summary(RegModel.sulphates)
##
## Call:
## lm(formula = sulphates ~ quality, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27761 -0.08069 -0.01377 0.05931 0.58239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.449189 0.010931 41.092 < 2e-16 ***
## quality 0.006917 0.001839 3.761 0.000171 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.114 on 4896 degrees of freedom
## Multiple R-squared: 0.002881, Adjusted R-squared: 0.002678
## F-statistic: 14.15 on 1 and 4896 DF, p-value: 0.000171
From this we see neither citric acid nor free sulfur dioxide appear to have a significant linear relationship with quality.
Now we will begin building a model using multiple regressions. However, prior to building the model, we will first split our dataset into a training and testing set.
set.seed(20170214) #Random Number seed is the date
White_wines$group <- runif(length(White_wines$quality), min = 0, max = 1) #create a new variable to add to dataset to distribute random numbers from 0-1
White_wines.train <- subset(White_wines, group <= 0.90) #assign 90% of the data to the training set
White_wines.test <- subset(White_wines, group > 0.90) #assign remaining data to the test set
#Did it work?
summary(White_wines.train)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.100
## Mean : 6.851 Mean :0.2784 Mean :0.3337 Mean : 6.342
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.800
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 3.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04574 Mean : 35.28 Mean :138.3
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.72 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.09 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.18 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.19 Mean :0.4892 Mean :10.52
## 3rd Qu.:0.9960 3rd Qu.:3.28 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.82 Max. :1.0800 Max. :14.20
## quality group
## Min. :3.000 Min. :0.0002833
## 1st Qu.:5.000 1st Qu.:0.2285282
## Median :6.000 Median :0.4596618
## Mean :5.879 Mean :0.4570277
## 3rd Qu.:6.000 3rd Qu.:0.6859608
## Max. :9.000 Max. :0.8998507
summary(White_wines.test)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 5.000 Min. :0.0800 Min. :0.0000 Min. : 0.800
## 1st Qu.: 6.400 1st Qu.:0.2175 1st Qu.:0.2600 1st Qu.: 2.100
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 6.300
## Mean : 6.889 Mean :0.2766 Mean :0.3387 Mean : 6.866
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.:10.400
## Max. :10.200 Max. :1.0050 Max. :0.8800 Max. :22.000
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01400 Min. : 2.0 Min. : 24.0
## 1st Qu.:0.03675 1st Qu.: 23.0 1st Qu.:108.0
## Median :0.04300 Median : 35.0 Median :135.0
## Mean :0.04612 Mean : 35.6 Mean :139.4
## 3rd Qu.:0.05000 3rd Qu.: 47.0 3rd Qu.:170.2
## Max. :0.20400 Max. :124.0 Max. :260.0
## density pH sulphates alcohol
## Min. :0.9877 Min. :2.770 Min. :0.280 Min. : 8.40
## 1st Qu.:0.9918 1st Qu.:3.080 1st Qu.:0.400 1st Qu.: 9.40
## Median :0.9941 Median :3.170 Median :0.480 Median :10.20
## Mean :0.9942 Mean :3.174 Mean :0.496 Mean :10.45
## 3rd Qu.:0.9964 3rd Qu.:3.260 3rd Qu.:0.560 3rd Qu.:11.30
## Max. :1.0010 Max. :3.690 Max. :1.010 Max. :13.90
## quality group
## Min. :3.000 Min. :0.9001
## 1st Qu.:5.000 1st Qu.:0.9229
## Median :6.000 Median :0.9528
## Mean :5.872 Mean :0.9506
## 3rd Qu.:6.000 3rd Qu.:0.9758
## Max. :8.000 Max. :0.9993
Now we will begin with a full model including all variables.
LinearModel.Full <- lm(quality ~ alcohol + chlorides + citric.acid +
density + fixed.acidity + free.sulfur.dioxide + pH + residual.sugar
+ sulphates + total.sulfur.dioxide + volatile.acidity,
data=White_wines.train)
summary(LinearModel.Full)
##
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density +
## fixed.acidity + free.sulfur.dioxide + pH + residual.sugar +
## sulphates + total.sulfur.dioxide + volatile.acidity, data = White_wines.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8642 -0.4973 -0.0362 0.4704 3.0782
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.552e+02 1.937e+01 8.013 1.42e-15 ***
## alcohol 1.885e-01 2.510e-02 7.510 7.10e-14 ***
## chlorides -2.444e-01 5.701e-01 -0.429 0.668114
## citric.acid 4.294e-02 1.010e-01 0.425 0.670887
## density -1.555e+02 1.965e+01 -7.917 3.06e-15 ***
## fixed.acidity 8.103e-02 2.176e-02 3.724 0.000199 ***
## free.sulfur.dioxide 4.064e-03 8.870e-04 4.581 4.74e-06 ***
## pH 7.268e-01 1.099e-01 6.614 4.19e-11 ***
## residual.sugar 8.492e-02 7.816e-03 10.865 < 2e-16 ***
## sulphates 6.578e-01 1.068e-01 6.156 8.10e-10 ***
## total.sulfur.dioxide -4.434e-04 3.963e-04 -1.119 0.263311
## volatile.acidity -1.822e+00 1.199e-01 -15.199 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7538 on 4426 degrees of freedom
## Multiple R-squared: 0.2805, Adjusted R-squared: 0.2787
## F-statistic: 156.9 on 11 and 4426 DF, p-value: < 2.2e-16
The full model can be used to explain 28% of the variability in taste. The F statistic is 156.9 and is highly significant. We will investigate what occurs as this model is reduced.
To continue we will use the backwards selection strategy and remove all variable that were not significant in the full model.
Reduced Model 1 will include alcohol, density, fixed acidity, free sulfur dioxide, pH, residal sugar, sulphates, volatile acidity.
LinearModel.2 <- lm(quality ~ alcohol + density + fixed.acidity +
free.sulfur.dioxide + pH + residual.sugar + sulphates + volatile.acidity,
data=White_wines.train)
summary(LinearModel.2)
##
## Call:
## lm(formula = quality ~ alcohol + density + fixed.acidity + free.sulfur.dioxide +
## pH + residual.sugar + sulphates + volatile.acidity, data = White_wines.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8536 -0.4930 -0.0388 0.4675 3.0889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.598e+02 1.872e+01 8.535 < 2e-16 ***
## alcohol 1.888e-01 2.495e-02 7.566 4.64e-14 ***
## density -1.603e+02 1.898e+01 -8.445 < 2e-16 ***
## fixed.acidity 8.386e-02 2.133e-02 3.931 8.58e-05 ***
## free.sulfur.dioxide 3.487e-03 7.137e-04 4.885 1.07e-06 ***
## pH 7.325e-01 1.078e-01 6.792 1.25e-11 ***
## residual.sugar 8.639e-02 7.594e-03 11.377 < 2e-16 ***
## sulphates 6.524e-01 1.064e-01 6.130 9.57e-10 ***
## volatile.acidity -1.861e+00 1.152e-01 -16.150 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7537 on 4429 degrees of freedom
## Multiple R-squared: 0.2802, Adjusted R-squared: 0.2789
## F-statistic: 215.5 on 8 and 4429 DF, p-value: < 2.2e-16
This reduced model can still be used to explain 28% of the variability in taste. The F statistic increased to 215.5 and is highly significant. We will investigate what occurs as this model is reduced.
View influential variables #Figure 17 Influential Observations fir model 2
#added variable plots
avPlots(LinearModel.2, id.n=2, id.cex=0.7)
#id.n - identify n most influential observations so you can pick out outlier values labeling them as farmers babysitters etc
#id.cex - controls the size of the dot
# run the qq-plot
qqPlot(LinearModel.2, id.n=3)
## 4746 254 2782
## 1 2 4438
# here, id.n identifies the n observations with the largest residuals in absolute value
# diagnostics for the first model with 3 independent variables
residualPlots(LinearModel.2)
## Test stat Pr(>|t|)
## alcohol 5.191 0.000
## density 5.552 0.000
## fixed.acidity -4.163 0.000
## free.sulfur.dioxide -10.160 0.000
## pH 0.880 0.379
## residual.sugar 2.520 0.012
## sulphates 0.729 0.466
## volatile.acidity 3.184 0.001
## Tukey test 2.551 0.011
#run Bonferroni test for outliers
outlierTest(LinearModel.2)
## rstudent unadjusted p-value Bonferonni p
## 4746 -5.285819 1.3116e-07 0.00058211
## 2782 4.931011 8.4800e-07 0.00376340
## 254 -4.496908 7.0712e-06 0.03138200
## 446 -4.485892 7.4449e-06 0.03304100
#make influence plot
influencePlot(LinearModel.2, id.n=3)
## StudRes Hat CookD
## 254 -4.4969082 0.002555952 0.005732828
## 1527 -0.6554449 0.038237083 0.001898028
## 1932 -3.7826585 0.015025299 0.024179469
## 2782 4.9310113 0.351726593 1.458130190
## 4746 -5.2858195 0.058668586 0.192314295
#test for heteroskedasticity
ncvTest(LinearModel.2) #tests for non constant variance.
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 16.04371 Df = 1 p = 6.189685e-05
vif(LinearModel.2)
## alcohol density fixed.acidity
## 7.337053 25.278419 2.545179
## free.sulfur.dioxide pH residual.sugar
## 1.151922 2.083100 11.620123
## sulphates volatile.acidity
## 1.126483 1.060061
#if higher than 4 we want to take variable out b/c it is not independent and highly correlates with something in there
Based on the previous plots/analysis we further reduce the model. will remove density, and residual sugar, and free sulfur dioxide from the analysis.
LinearModel.3 <- lm(quality ~ alcohol + fixed.acidity +
free.sulfur.dioxide + pH + residual.sugar + sulphates + volatile.acidity,
data=White_wines.train)
summary(LinearModel.3)
##
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity + free.sulfur.dioxide +
## pH + residual.sugar + sulphates + volatile.acidity, data = White_wines.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8938 -0.4962 -0.0333 0.4624 3.1774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7101754 0.3536637 4.836 1.37e-06 ***
## alcohol 0.3800365 0.0105722 35.947 < 2e-16 ***
## fixed.acidity -0.0451858 0.0150020 -3.012 0.002610 **
## free.sulfur.dioxide 0.0037010 0.0007189 5.148 2.74e-07 ***
## pH 0.1719520 0.0856708 2.007 0.044797 *
## residual.sugar 0.0261723 0.0026316 9.946 < 2e-16 ***
## sulphates 0.3989868 0.1029242 3.877 0.000107 ***
## volatile.acidity -2.0134448 0.1146832 -17.557 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7597 on 4430 degrees of freedom
## Multiple R-squared: 0.2686, Adjusted R-squared: 0.2675
## F-statistic: 232.5 on 7 and 4430 DF, p-value: < 2.2e-16
This reduced model can still be used to explain 27% of the variability in taste. The F statistic increased to 232.5 and is still highly significant. #Figure 23: Residual Plots for model 3
# diagnostics for the first model with 3 independent variables
residualPlots(LinearModel.3)
## Test stat Pr(>|t|)
## alcohol 5.243 0.000
## fixed.acidity -3.584 0.000
## free.sulfur.dioxide -10.370 0.000
## pH 0.386 0.700
## residual.sugar -2.049 0.041
## sulphates 0.878 0.380
## volatile.acidity 1.968 0.049
## Tukey test 0.145 0.884
We will investigate what occurs as this model is further reduced by removing free sulfur dioxide.
LinearModel.4 <- lm(quality ~ alcohol + fixed.acidity + pH + residual.sugar + sulphates + volatile.acidity,
data=White_wines.train)
summary(LinearModel.4)
##
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity + pH + residual.sugar +
## sulphates + volatile.acidity, data = White_wines.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4043 -0.4962 -0.0369 0.4662 3.1503
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.914453 0.352441 5.432 5.87e-08 ***
## alcohol 0.373244 0.010520 35.481 < 2e-16 ***
## fixed.acidity -0.051461 0.014995 -3.432 0.000605 ***
## pH 0.178963 0.085906 2.083 0.037286 *
## residual.sugar 0.029424 0.002562 11.485 < 2e-16 ***
## sulphates 0.432488 0.103013 4.198 2.74e-05 ***
## volatile.acidity -2.080385 0.114271 -18.206 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7618 on 4431 degrees of freedom
## Multiple R-squared: 0.2643, Adjusted R-squared: 0.2633
## F-statistic: 265.3 on 6 and 4431 DF, p-value: < 2.2e-16
This reduced model can still be used to explain 26% of the variability in taste. The F statistic increased to 265.3 and is still highly significant.
# diagnostics for the first model with 3 independent variables
residualPlots(LinearModel.4)
## Test stat Pr(>|t|)
## alcohol 5.496 0.000
## fixed.acidity -3.795 0.000
## pH 0.165 0.869
## residual.sugar -2.603 0.009
## sulphates 1.054 0.292
## volatile.acidity 1.760 0.078
## Tukey test -0.203 0.839
vif(LinearModel.4)
## alcohol fixed.acidity pH residual.sugar
## 1.276125 1.230980 1.293699 1.294518
## sulphates volatile.acidity
## 1.032790 1.020648
#if higher than 4 we want to take variable out b/c it is not independent and highly correlates with something in there
LinearModel.5 <- lm(quality ~ alcohol + fixed.acidity + residual.sugar + sulphates + volatile.acidity,
data=White_wines.train)
summary(LinearModel.5)
##
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity + residual.sugar +
## sulphates + volatile.acidity, data = White_wines.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3580 -0.4939 -0.0352 0.4642 3.1857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.560289 0.167709 15.27 < 2e-16 ***
## alcohol 0.373620 0.010522 35.51 < 2e-16 ***
## fixed.acidity -0.064489 0.013634 -4.73 2.31e-06 ***
## residual.sugar 0.028655 0.002536 11.30 < 2e-16 ***
## sulphates 0.468359 0.101603 4.61 4.15e-06 ***
## volatile.acidity -2.088819 0.114243 -18.28 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7621 on 4432 degrees of freedom
## Multiple R-squared: 0.2635, Adjusted R-squared: 0.2627
## F-statistic: 317.2 on 5 and 4432 DF, p-value: < 2.2e-16
This reduced model can still be used to explain 26% of the variability in taste. The F statistic increased to 317.2 and is still highly significant. An increase in the adjusted R squared indicates it may fit better than the previous model.
# diagnostics for the first model with 3 independent variables
residualPlots(LinearModel.5)
## Test stat Pr(>|t|)
## alcohol 5.143 0.000
## fixed.acidity -3.568 0.000
## residual.sugar -2.491 0.013
## sulphates 0.929 0.353
## volatile.acidity 1.930 0.054
## Tukey test -0.471 0.637
vif(LinearModel.5)
## alcohol fixed.acidity residual.sugar sulphates
## 1.275751 1.016852 1.267666 1.003936
## volatile.acidity
## 1.019367
#if higher than 4 we want to take variable out b/c it is not independent and highly correlates with something in there
Based on the residuals I would like to see what happens when residual sugar and fixed acidity are removed from the model.
LinearModel.6 <- lm(quality ~ alcohol + sulphates + volatile.acidity,
data=White_wines.train)
summary(LinearModel.6)
##
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity,
## data = White_wines.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3158 -0.4886 -0.0468 0.4947 3.1571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.785383 0.116676 23.873 < 2e-16 ***
## alcohol 0.325036 0.009493 34.239 < 2e-16 ***
## sulphates 0.434956 0.103148 4.217 2.53e-05 ***
## volatile.acidity -1.936946 0.115352 -16.792 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7744 on 4434 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2388
## F-statistic: 465 on 3 and 4434 DF, p-value: < 2.2e-16
This model does not appear to fit better than the previous model. It only accounts for 24% of the variability and while still significant the adjust R squared value has decreased from .26 to .24.
compareCoefs(LinearModel.2, LinearModel.3, LinearModel.4, LinearModel.5,
LinearModel.Full)
##
## Call:
## 1: lm(formula = quality ~ alcohol + density + fixed.acidity +
## free.sulfur.dioxide + pH + residual.sugar + sulphates +
## volatile.acidity, data = White_wines.train)
## 2: lm(formula = quality ~ alcohol + fixed.acidity +
## free.sulfur.dioxide + pH + residual.sugar + sulphates +
## volatile.acidity, data = White_wines.train)
## 3: lm(formula = quality ~ alcohol + fixed.acidity + pH +
## residual.sugar + sulphates + volatile.acidity, data =
## White_wines.train)
## 4: lm(formula = quality ~ alcohol + fixed.acidity + residual.sugar +
## sulphates + volatile.acidity, data = White_wines.train)
## 5: lm(formula = quality ~ alcohol + chlorides + citric.acid + density
## + fixed.acidity + free.sulfur.dioxide + pH + residual.sugar +
## sulphates + total.sulfur.dioxide + volatile.acidity, data =
## White_wines.train)
## Est. 1 SE 1 Est. 2 SE 2 Est. 3
## (Intercept) 1.60e+02 1.87e+01 1.71e+00 3.54e-01 1.91e+00
## alcohol 1.89e-01 2.50e-02 3.80e-01 1.06e-02 3.73e-01
## density -1.60e+02 1.90e+01
## fixed.acidity 8.39e-02 2.13e-02 -4.52e-02 1.50e-02 -5.15e-02
## free.sulfur.dioxide 3.49e-03 7.14e-04 3.70e-03 7.19e-04
## pH 7.32e-01 1.08e-01 1.72e-01 8.57e-02 1.79e-01
## residual.sugar 8.64e-02 7.59e-03 2.62e-02 2.63e-03 2.94e-02
## sulphates 6.52e-01 1.06e-01 3.99e-01 1.03e-01 4.32e-01
## volatile.acidity -1.86e+00 1.15e-01 -2.01e+00 1.15e-01 -2.08e+00
## chlorides
## citric.acid
## total.sulfur.dioxide
## SE 3 Est. 4 SE 4 Est. 5 SE 5
## (Intercept) 3.52e-01 2.56e+00 1.68e-01 1.55e+02 1.94e+01
## alcohol 1.05e-02 3.74e-01 1.05e-02 1.88e-01 2.51e-02
## density -1.56e+02 1.96e+01
## fixed.acidity 1.50e-02 -6.45e-02 1.36e-02 8.10e-02 2.18e-02
## free.sulfur.dioxide 4.06e-03 8.87e-04
## pH 8.59e-02 7.27e-01 1.10e-01
## residual.sugar 2.56e-03 2.87e-02 2.54e-03 8.49e-02 7.82e-03
## sulphates 1.03e-01 4.68e-01 1.02e-01 6.58e-01 1.07e-01
## volatile.acidity 1.14e-01 -2.09e+00 1.14e-01 -1.82e+00 1.20e-01
## chlorides -2.44e-01 5.70e-01
## citric.acid 4.29e-02 1.01e-01
## total.sulfur.dioxide -4.43e-04 3.96e-04
# compare the results of the two regression models
stargazer(LinearModel.4,LinearModel.5, LinearModel.6,title="Comparison of Regression outputs",type="text",align=TRUE)
Dependent variable:
-----------------------------------------------------------------------------
quality
(1) (2) (3)
| alcohol 0.373*** 0.374*** 0.325*** (0.011) (0.011) (0.009) |
| fixed.acidity -0.051*** -0.064*** (0.015) (0.014) |
| pH 0.179** (0.086) |
| residual.sugar 0.029*** 0.029*** (0.003) (0.003) |
| sulphates 0.432*** 0.468*** 0.435*** (0.103) (0.102) (0.103) |
| volatile.acidity -2.080*** -2.089*** -1.937*** (0.114) (0.114) (0.115) |
| Constant 1.914*** 2.560*** 2.785*** (0.352) (0.168) (0.117) |
Observations 4,438 4,438 4,438
R2 0.264 0.264 0.239
Adjusted R2 0.263 0.263 0.239
Residual Std. Error 0.762 (df = 4431) 0.762 (df = 4432) 0.774 (df = 4434)
F Statistic 265.260*** (df = 6; 4431) 317.205*** (df = 5; 4432) 465.047*** (df = 3; 4434) ================================================================================================= Note: p<0.1; p<0.05; p<0.01
#can only be seen when knitting to html if you change type to text you can see the table now type=html or text or latek as options
#test for heteroskedasticity
ncvTest(LinearModel.5) #tests for non constant variance. All biomarkers fail this test. since p is big its a homoskedastic set
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 25.17967 Df = 1 p = 5.222981e-07
vif(LinearModel.5)
## alcohol fixed.acidity residual.sugar sulphates
## 1.275751 1.016852 1.267666 1.003936
## volatile.acidity
## 1.019367
#if higher than 4 we want to take variable out b/c it is not independent and highly correlates with something in there
#make influence plot
influencePlot(LinearModel.5, id.n=3)
## StudRes Hat CookD
## 254 -4.4170724 0.0008409785 0.002725574
## 446 -4.2524949 0.0009687925 0.002911503
## 741 -4.3592646 0.0013393889 0.004230614
## 1418 -3.5644279 0.0041885139 0.008883125
## 1527 0.6459469 0.0181910902 0.001288639
## 2051 -3.3219899 0.0081047808 0.014994727
## 2782 -0.8366248 0.0497611466 0.006109381
## 4040 -1.1217840 0.0154427311 0.003289463
## 4481 -3.0490524 0.0064404365 0.010025076
Based on this data I believe LinearModel.5 to be the best model of this data. Currently the model accounts for 26% of the variability in the score for quality. While I am not pleased with the plots of the residuals or influential points, and I would also like to include less variables in the model. However I am unsure how much it is aceptable to balance these flaws for the amount of variability accounted for by the model.
We will now run the model on the testing dataset.
LinearModel.test <- lm(quality ~ alcohol + fixed.acidity + residual.sugar + sulphates + volatile.acidity,
data=White_wines.test)
summary(LinearModel.test)
##
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity + residual.sugar +
## sulphates + volatile.acidity, data = White_wines.test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.71904 -0.47984 -0.03971 0.46703 2.62025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.837360 0.501036 7.659 1.14e-13 ***
## alcohol 0.340385 0.031312 10.871 < 2e-16 ***
## fixed.acidity -0.165806 0.041642 -3.982 7.97e-05 ***
## residual.sugar 0.015319 0.007811 1.961 0.0505 .
## sulphates 0.248848 0.269160 0.925 0.3557
## volatile.acidity -2.203158 0.346791 -6.353 5.14e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7234 on 454 degrees of freedom
## Multiple R-squared: 0.3123, Adjusted R-squared: 0.3047
## F-statistic: 41.23 on 5 and 454 DF, p-value: < 2.2e-16
Running this model on the test data provides a significant F statistic, with the model explaining 31% of the variability in the score for quality. This model uses alcohol, fixed acidity, residual sugar, sulphates, volatile acidity to explain quality. This equation for this model is:
Y = 4.10 + (0.34)x1 + (-0.17)x2 + (0.02)x3 + (0.25)x4 + (-2.20)x5 + E
Where: Y= quality x1= alcohol x2= fixed acidity x3= residual sugar x4= sulphates x5= volatile acid E= Error
Using this model it appears volatile acid influences quality the most. When keeping the other variables constant a 1 point change in volatile acid will cause a -2.20 change in the quality score of the wine.
Running the model on the full data set uses alcohol, fixed acidity, pH, residual sugar, sulphates, volatile acidity to explain quality to explain 27% of variability in quality score of the wine. Investigating the usefulness of model can be completed using the diagnositc test seen in the plots below.
LinearModel.testfull <- lm(quality ~ alcohol + fixed.acidity + residual.sugar + sulphates + volatile.acidity,
data=White_wines)
summary(LinearModel.testfull)
##
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity + residual.sugar +
## sulphates + volatile.acidity, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3693 -0.4939 -0.0341 0.4634 3.2090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.671921 0.158935 16.811 < 2e-16 ***
## alcohol 0.370957 0.009972 37.199 < 2e-16 ***
## fixed.acidity -0.073315 0.012958 -5.658 1.62e-08 ***
## residual.sugar 0.027559 0.002412 11.427 < 2e-16 ***
## sulphates 0.443294 0.095155 4.659 3.27e-06 ***
## volatile.acidity -2.102798 0.108510 -19.379 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7588 on 4892 degrees of freedom
## Multiple R-squared: 0.2667, Adjusted R-squared: 0.266
## F-statistic: 355.9 on 5 and 4892 DF, p-value: < 2.2e-16
# diagnostics for the first model with 3 independent variables
residualPlots(LinearModel.testfull)
## Test stat Pr(>|t|)
## alcohol 5.264 0.000
## fixed.acidity -3.765 0.000
## residual.sugar -2.360 0.018
## sulphates 0.432 0.666
## volatile.acidity 2.534 0.011
## Tukey test -0.652 0.514
# diagnostics for the second model with 2 independent variables
residualPlots(LinearModel.testfull)
## Test stat Pr(>|t|)
## alcohol 5.264 0.000
## fixed.acidity -3.765 0.000
## residual.sugar -2.360 0.018
## sulphates 0.432 0.666
## volatile.acidity 2.534 0.011
## Tukey test -0.652 0.514
#added variable plots
avPlots(LinearModel.testfull, id.n=2, id.cex=0.7)
#id.n - identify n most influential observations so you can pick out outlier values labeling them as farmers babysitters etc
#id.cex - controls the size of the dot
# run the qq-plot
qqPlot(LinearModel.testfull, id.n=3)
## 254 741 446
## 1 2 3
# here, id.n identifies the n observations with the largest residuals in absolute value
#run Bonferroni test for outliers
outlierTest(LinearModel.testfull)
## rstudent unadjusted p-value Bonferonni p
## 254 -4.450691 8.7497e-06 0.042856
#identify highly influential points
influenceIndexPlot(LinearModel.testfull, id.n=3)
#make influence plot
influencePlot(LinearModel.testfull, id.n=3)
## StudRes Hat CookD
## 254 -4.4506905 0.0007658563 2.520676e-03
## 446 -4.2595629 0.0008827298 2.662385e-03
## 741 -4.3741368 0.0012254319 3.898059e-03
## 1418 -3.5339258 0.0037893386 7.898727e-03
## 1527 0.7303460 0.0165631880 1.497425e-03
## 1952 0.1694621 0.0147172029 7.150637e-05
## 2051 -3.2721210 0.0073798698 1.324074e-02
## 2782 -0.7153193 0.0454303839 4.059110e-03
## 4481 -3.0466504 0.0058563248 9.097779e-03
#test for heteroskedasticity
ncvTest(LinearModel.testfull) #tests for non constant variance. All biomarkers fail this test. since p is big its a homoskedastic set
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 26.55193 Df = 1 p = 2.565479e-07
vif(LinearModel.testfull)
## alcohol fixed.acidity residual.sugar sulphates
## 1.280977 1.017086 1.272816 1.003092
## volatile.acidity
## 1.017465
#if higher than 4 we want to take variable out b/c it is not independent and highly correlates with something in there
Running this model on the full dataset provides a significant F statistic, with the model explaining 27% of the variability in the score for quality. This model uses alcohol, fixed acidity, residual sugar, sulphates, volatile acidity to explain quality. This equation for this model is:
Y = 4.10 + (0.37)x1 + (-0.07)x2 + (0.03)x3 + (0.44)x4 + (-2.10)x5 + E
Where: Y= quality x1= alcohol x2= fixed acidity x3= residual sugar x4= sulphates x5= volatile acid E= Error
Using this model it still appears volatile acid has the largest influence on quality of wine. When keeping the other variables constant a 1 point change in volatile acid will cause a -2.20 change in the quality score of the wine. This seems to agree with how wine is tested to identify if it is spoiled, as volatile acid is traditionally used a measure of wine spoilage. Volatile acid is a measure of a variety of biproducts that have accumulated in wine. These biproducts include acetic, lactic, formic, butyric, and propionic acids. There are legal limits of the volatile acid allowed in a batch of wine. Levels higher than the legal amount indicate the wine has over fermented (Neeley, 2004).
Louis Pasteur sought to discover the cause of alcohol spoiling in 1857, and as a result discovered acetic acid producing bacteria as the culprit. The aerobic nature of these bacteria cause this process occurs faster in the presence of oxygen, and is the reason many tools exist to vaccum the oxygen out of a bottle of wine. The findings within this dataset appear to agree with the findings of Louis Pasteur.
Neeley, E. (2004). Volatile Acidity. Waterhouse Lab: UC Davis. Retrieved from http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity
The rmarkdown used to create this file can be found at https://github.com/amonda/Regression-1. The file is named New.Rmd.